Improving Methods for Single-label Text Categorization

نویسنده

  • Isabel Maria Martins Trancoso
چکیده

As the volume of information in digital form increases, the use of Text Categorization techniques aimed at finding relevant information becomes more necessary. To improve the quality of the classification, I propose the combination of different classification methods. The results show that k-NN-LSI, the combination of k-NNwith LSI, presents an average Accuracy on the five datasets that is higher than the average Accuracy of each original method. The results also show that SVM-LSI, the combination of SVM with LSI, outperforms both original methods in some datasets. Having in mind that SVM is usually the best performing method, it is particularly interesting that SVM-LSI performs even better in some situations. To reduce the number of labeled documents needed to train the classifier, I propose the use of a semi-supervised centroid-based method that uses information from small volumes of labeled data together with information from larger volumes of unlabeled data for text categorization. Using one synthetic dataset and three real-world datasets, I provide empirical evidence that, if the initial classifier for the data is sufficiently precise, using unlabeled data improves performance. On the other hand, using unlabeled data actually degrades the results if the initial classifier is not good enough. The dissertation includes a comprehensive comparison between the classification methods that are most frequently used in the Text Categorization area and the combinations of methods proposed. Palavras-chave Classificação de Texto Conjuntos de Dados Medidas de Avaliação Combinação de Métodos de Classificação Classificação Semi-supervisionada Classificação Incremental

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Comparison of Text Categorization Methods

In this paper firstly I have compared Single Label Text Categorization with Multi Label Text Categorization in detail then I have compared Document Pivoted Categorization with Category Pivoted Categorization in detail. For this purpose I have given the general definition of Text Categorization with its mathematical notation for the purpose of its frugality and cost effectiveness. Then with the ...

متن کامل

Exploiting Associations between Class Labels in Multi-label Classification

Multi-label classification has many applications in the text categorization, biology and medical diagnosis, in which multiple class labels can be assigned to each training instance simultaneously. As it is often the case that there are relationships between the labels, extracting the existing relationships between the labels and taking advantage of them during the training or prediction phases ...

متن کامل

Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA

With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...

متن کامل

Combining LSI with other Classifiers to Improve Accuracy of Single-label Text Categorization

This paper describes the combination of k-NN and SVM with LSI to improve their performance in single-label text categorization tasks, and the experiments performed with six datasets to show that both k-NN-LSI (the combination of k-NN with LSI) and SVM-LSI (the combination of SVM with LSI) outperform the original methods for a significant fraction of the datasets. Overall, both combinations pres...

متن کامل

Improving Feature Selection Techniques for Machine Learning

As a commonly used technique in data preprocessing for machine learning, feature selection identifies important features and removes irrelevant, redundant or noise features to reduce the dimensionality of feature space. It improves efficiency, accuracy and comprehensibility of the models built by learning algorithms. Feature selection techniques have been widely employed in a variety of applica...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007